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Abstract 

Block coordinate descent methods and stochastic subgradient methods have been extensively 
studied in optimization and machine learning. By combining randomized block sampling with 
stochastic subgradient methods based on dual averaging (EH (Ml), we present stochastic block 
dual averaging (SBDA)—a novel class of block subgradient methods for convex nonsmooth and 
stochastic optimization. SBDA requires only a block of subgradients and updates blocks of vari¬ 
ables and hence has significantly lower iteration cost than traditional subgradient methods. We 
show that the SBDA-based methods exhibit the optimal convergence rate for convex nonsmooth 
stochastic optimization. More importantly, we introduce randomized stepsize rules and block 
sampling schemes that are adaptive to the block structures, which significantly improves the 
convergence rate w.r.t. the problem parameters. This is in sharp contrast to recent block sub¬ 
gradient methods applied to nonsmooth deterministic or stochastic optimization (El EH). For 
strongly convex objectives, we propose a new averaging scheme to make the regularized dual 
averaging method optimal, without having to resort to any accelerated schemes. 


1 Introduction 

In this paper, we mainly focus on the following convex optimization problem: 

min (j) (x ), (1) 

xeX 

where the feasible set X is embedded in Euclidean space for some integer A > 0. Letting 
Ni,N 2, ■ ■ ■, Nn be n positive integers such that ~ assume X can be partitioned as 

X = Xi X X2 X ... Xn, where each Aj C We denote x G X,hy x = x ... x x^"^ where 
x(*) £ Xi. The objective (p (x) consists of two parts: cf) (x) = / (x) +00 (x). We stress that both / (x) 
and uj (x) can be nonsmooth, u (x) is a convex function with block separable structure: uj (x) = 
Bikini (xj), where each cuj : Aj —^ M is convex and relatively simple. In composite optimization or 
regularized learning, the term u (x) imposes solutions with certain preferred structures. Common 
examples of oj (•) include the ii norm or squared £2 norm regularizers. / (x) is a general convex 
function. In many important statistical learning problems, / (x) has the form of / (x) = [F (x, ^)], 
where F (x, is a convex loss function of x £ A with ^ representing sampled data. When it is 
difficult to evaluate /(x) exactly, as in batch learning or sample average approximation (SAA), / (x) 
is approximated with finite data. Firstly, a large number of samples are drawn, and 

then / (x) is approximated by / (x) = A F (x, ^j), with the alternative problem: 
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min (j)(x) := f (x) + CO (x). (2) 

x&X 

However, although classic first order methods can provide accurate solutions to (|2|), the major 
drawback of these approaches is the poor scalability to large data. First order deterministic methods 
require full information of the (sub)gradient and scan through the entire dataset many times, which 
is prohibitive for applications where scalability is paramount. In addition, due to the statistical 
nature of the problem, solutions with high precision may not even be necessary. 

To solve the aforementioned problems, stochastic methods—stochastic (sub)gradient descent 
(SGD) or block coordinate descent (BCD) have received considerable attention in the machine 
learning community. Both of them confer new advantages in the trade offs between speed and 
accuracy. Compared to deterministic and full (sub)gradient methods, they are easier to implement, 
have much lower computational complexity in each iteration, and often exhibit sufficiently fast 
convergence while obtaining practically good solutions. 

SCD was first studied in |29) in the 1950s, with the emphasis mainly on solving strongly convex 
problems; specifically it only needs the gradient/subgradient on a few data samples while iteratively 
updating all the variables. In the approach of online learning or stochastic approximation (SA), 
SCD directly works on the objective ([T]), and obtains convergence independent of the sample size. 
While early work emphasizes asymptotic properties, recent work investigate complexity analysis 
of convergence. Many works ([m Hal Eai EH n m E]) investigate the optimal SCD under various 
conditions. Proximal versions of SCD, which explicitly incorporate the regularizer co (x), have been 
studied, for example in [laiiEiiM]. 

The study of BCD also has a long history. BCD was initiated in |18[ 119) . but the application of 
BCD to linear systems dates back to even earlier (for example see the Causs-Seidel method in jH)- 
It works on the approximated problem (|2]) and makes progress by reducing the original problem 
into subproblems using only a single block coordinate of the variable at a time. Recent works 
|23l EH EH HU study BCD with random sampling (RBCD) and obtain non-asymptotic complexity 
rates. For the regularized learning problem as in (j2|), RBCD on the dual formulation has been 
proposed [3I1HI1E2]. Although most of the work on BCD focuses on smooth (composite) objectives, 
some recent work (H EH EH EH) seeks to extend the realm of BCD in various ways. The works 
in [MIE] discuss (block) subgradient methods for nonsmooth optimization. Combining the ideas of 
SCD and BCD, the works in [H E3 EH EH EZ] employ sampling of both features and data instances 
in BCD. 

In this paper, we propose a new class of block subgradient methods, namely, stochastic block 
dual averaging (SBDA), for solving nonsmooth deterministic and stochastic optimization prob¬ 
lems. Specifically, SBDA consists of a new dual averaging step incorporating the average of all 
past (stochastic) block subgradients and variable updates involving only block components. We 
bring together two strands of research, namely, the dual averaging algorithm (DA) |36| EH which 
was studied for nonsmooth optimization and randomized coordinate descent (RCD) |23) . employed 
for smooth deterministic problems. Our main contributions consist of the following: 

• Two types of SBDA have been proposed for different purposes. For regularized learning, 
we propose SBDA-u which performs uniform random sampling of blocks. For more general 
nonsmooth learning problems, we propose SBDA-r which applies an optimal sampling scheme 
with improved convergence. Compared with existing subgradient methods for nonsmooth and 
stochastic optimization, both SBDA-u and SBDA-r have significantly lower iteration cost when 
the computation of block subgradients and block updates are convenient. 

• We contribute a novel scheme of randomized stepsizes and optimized sampling strategies which 
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are truly adaptive to the block structures. Selecting block-wise stepsizes and optimal block 
sampling have been critical issues for speeding up BCD for smooth regularized problems, 
please see [231 [251EU El] for some recent advances. For nonsmooth or stochastic optimization, 
the most closely related work to ours are la ED which do not apply block-wise stepsizes. 
To the best of our knowledge, this is the first time block subgradient methods with block 
adaptive stepsizes and optimized sampling have been proposed for nonsmooth and stochastic 
optimization. 

• We provide new theoretical guarantees of convergence of SBDA methods. SBDA obtains the 
optimal rate of convergence for general convex problems, matching the state of the art results 
in the literature of stochastic approximation and online learning. More importantly, SBDA 
exhibits a significantly improved convergence rate w.r.t. the problem parameters. When the 
regularizer oj (x) is strongly convex, our analysis provides a simple way to make the regularized 
dual averaging methods in |36j optimal. We show an aggressive weighting is sufficient to obtain 
O (y)convergence where T is the iteration count, without the need for any accelerated schemes. 
This appears to be a new result for simple dual averaging methods. 

Related work Extending BCD to the realm of nonsmooth and stochastic optimization has been of 
interest lately. Efficient subgradient methods for a class of nonsmooth problems has been proposed 
in |24j . However, to compute the stepsize, the block version of this subgradient method requires 
computation of the entire subgradient and knowledge of the optimal value; hence, it may be not 
efficient in a more general setting. The methods in |3l E3] employ stepsizes that are not adaptive to 
the block selection and have therefore suboptimal bounds to our work. For SA or online learning, 
SBDA applies double sampling of both blocks and data. A similar approach has also been employed 
for new stochastic methods in some very recent work ([3l [39l EH E3 E2])' should be noted here 
that if the assumptions are strengthened, namely, in the batch learning formulation, and if is 
smooth, it is possible to obtain a linear convergence rate O (e“^). Nesterov’s randomized block 
coordinate methods |23[ [28] consider different stepsize rules and block sampling but only for smooth 
objectives with possible nonsmooth regularizers. Recently, nonuniform sampling in BCD has been 
addressed in |25| [38] [TH] and shown to have advantages over uniform sampling. Although our work 
discusses block-wise stepsizes and nonuniform sampling as well, we stress the nonsmooth objectives 
that appear in deterministic and stochastic optimization . The proposed algorithms employ very 
different proof techniques, thereby obtaining different optimized sampling distributions. 


Outline of the results. 


We introduce two versions of SBDA that are appropriate in different contexts. The first algo¬ 
rithm, SBDA with uniform block sampling (SBDA-u) works for a class of convex composite func¬ 
tions, namely, uj (x) is explicate in the proximal step. When oj{x) is a general convex function, 
for example, the sparsity regularizer ||x||i, we show that SBDA-u obtains the convergence rate of 


by SBMD. Here {Mi} and 


O ^ which improves the rate of O 

{Di} are some parameters associated with the blocks of coordinates to be specified later. When 
w(x) is a strongly convex function, by using a more aggressive scheme to be later specified, SBDA-u 
obtains the optimal rate of O ^^; matching the result from SBMD. In addition, for gen¬ 
eral convex problems in which uj{x) = 0, we propose a variant of SBDA with nonuniform random 

^ 2/3 ^ 1 / 3 ^/^ \ 

. These 


sampling (SBDA-r) which achieves an improved convergence rate O 


Vt 
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Algorithm 

Objective 

Complexity 

SBDA-u 

Convex composite 


SBDA-u 

Strongly convex composite 

0 ( at ) 

SBDA-r 

Convex nonsmooth 


1 _ L 


Table 1: Iteration complexity of our SBDA algorithms. 


computational results are summarized in Table ([T]). 

Structure of the Paper The paper proceeds as follows. Section 2 introduces the notation used 
in this paper. Section 3 presents and analyzes SBDA-u. Section 4 presents SBDA-r, and discusses 
optimal sampling and its convergence. Experimental results to demonstrate the performance of 
SBDA are provided in section 6. Section 7 draws conclusion and comments on possible future 
directions. 

2 Preliminaries 

Let be a Euclidean vector space, A^i, A^ 2 ) • • • be n positive integers such that A^i +... Nn = N. 
Let I be the identity matrix in , Ui he a N x Aj-dim matrix such that 

I = [U 1 U 2 ...Un]. 

For each x € , we have the decomposition: x = Uix^^'^ + U 2 X^‘^'> + ... + UnX^'^\ where 

Let II • II(j) denote the norm on the || • ||(i),* be the induced dual norm. We define the 

norm || • || in by: ||x|p = Yl'i=i bs dual norm: || • ||* by ||x||^ = ^ 11=1 * 

Let dj : Aj —7- M be a distance transform function with modulus || • ||(j) with respect to p.. di (•) 
is continuously differentiable and strongly convex: 

di {ax + {l-a)y) < adi (x) + (1 - a) d* (y) - ^pa (1 - a) ||x - y||^.), x,y G A*, 

i = 1, 2,..., n. 

Let us assume there exists a solution x* £ A to the problem ([1]) , and 

dj(x*b)) < Dj < 00 , i = l,2,...n, (3) 

Without loss of generality, we assume d* (•) is nonnegative, and write 

n 

d{x) = '^di{x^^^^ (4) 

i 

for simplicity. Further more, we define the Bregman divergence associated with di (•) by 
Vi {z, x) = di (x) - di (z) - (Vjd (z) ,x - z) , z,x e A*. 

and V iz,x) = Yli 
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We denote f{x) = E^[F (x,^)], and let G{x,^) be a subgradient of F{x,^), and g{x) = 
F^ [G(x,^)] G df (x) be a subgradient of f{x). Let (•), (x,^) denote their Lth block com¬ 

ponents , for i = 1,2,... ,n. Throughout the paper, we assume the (stochastic) block coordinate 
subgradient of / satisfying: 


lls" 


|2 


= 


||GW(x,OIIw,J <E[||G«(a:,0 


|2 


< Mf, Vx G X 


(5) 


for i = 1, 2,... , n. Note that although we make assumptions of stochastic objective , the following 
analysis and conclusions naturally extend to deterministic optimization. To see that, we can simply 
assume g (x) = G{x,^), and / (x) = F{x,^), for any 

Before introducing the main convergence properties, we first summarize several useful results in 
the following lemmas. Lemma [H |2l and [3] slightly generalize the results in |341114). |22) . and [13] 
respectively; their proofs are left in Appendix. 

Lemma 1. Let /(•) be a lower semicontinuous convex function and d{-) he defined by If 


z = arg min T (x) := / (x) -|- d (x), 

X 


then 

T (x) > T (z) + V {z, x), Vx G X. 

Moreover, if f (x) is X-strongly convex with norm || • and x = z + Uiy G X where y G Xi, z G X, 
then 

T(x) > T {z)+V{z,x) + ^WvWfi), Vx G X. 

Lemma 2. Let T : X —)• M &e convex, block separable, and pi-strongly convex with modulus pi w.r.t. 
II • ||(j) , Pi> Q, I <i <n, and g G If 


xq G arg min {'L (x)} , and z 
x&X 


arg mm 
xex 


{(c/,5®,x) + 4/(x)}. 


then 


xo) + 4/ (xo) < + 'L (^) + 

Lemma 3. If f satisfies the assumption (0), let x = z + Uiy G X where y G Xi, x G X, then 

f{z) < f{x) + {g^^\x),y) + 2Mi\\y\\^iy (6) 


3 Uniformly randomized SBDA (SBDA-u) 


In this section, we describe uniformly randomized SBDA (SBDA-u) for the composite problem ([T]). 
We consider the formulation proposed in |36| . since it incorporates the regularizers for composite 
problems. The main update of the DA algorithm has the form 


Xt+i = arg min 
x£X 


{Gs, x) + tuj (x) -k fitd (x) 


. 5=1 


(7) 


where {fit} is a parameter sequence and Gg is shorthand for G (x^, ^s), and d (x) is a strongly convex 
proximal function. When oj (x) = 0, this reduces to a version of Nesterov’s primal-dual subgradient 
method [22]. 
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Let G = {xs,Cs), where {at} is a sequence of positive values, {it} is a sequence 

of sampled indices. The main iteration step of SBDA has the form 


(it) 

= arff mm 
xex 


m |(GLL,a;) + (x) + (x)| , 


( 8 ) 


and = x[^\ i it- 

We highlight two important aspects of the proposed iteration (jS]). Firstly, the update in ([8| 
incorporates the past randomly sampled block (stochastic) subgradients (xt,^t)}, rather than 

the full (stochastic) subgradients. Meanwhile, the update of the primal variable is restricted to 
the same block (it), leaving the other blocks untouched. Such block decomposition significantly 
reduces the iteration cost of the dual averaging method when the block-wise operation is convenient. 
Secondly, (| 8 |) employs a novel randomized stepsize sequence { 74 } where 7 * £ M”. More specifically, 
7 t depends not only on the iteration count t, but also on the block index if. {yt} satisfies the 
assumptions. 


7 p) ^ / if, and 7 fL) > , j = if. (9) 

The most important aspect of (jl]) is that stepsizes can be specified for each block of coordinates, 
thereby allowing for aggressive descent. As will be shown later, the rate of convergence, in terms 
of the problem parameters, can be significantly reduced by properly choosing these control param¬ 
eters. In addition, we allow the sequence {af} and the related {It} to be variable, hence offer the 
opportunity of different averaging schemes in composite settings. To summarize, the full SBDA-u is 
described in Algorithm [T] 


Input: convex composite function eft (x) = / (x) -|- ui (x), a sequence of samples {^f}; 

initialize a_i G M, 7 _i G M"', /_i = 0”, G = 0^^, xq = argminj;gx Y17=i 7-i^* (xW); 

for t = 0,1,..., T — 1 do 

sample a block if G {1,2,... ,n} with uniform probability 
set 7 f\ i = 1 , 2 ,... ,n; 

set /LL _ _j_ ^(1) _ fQj. j ^ ij; 

update G: G = G + atUi^G^'^^'> (xf , ^f); 

update x["l\ = avguiin^^Xi^ |(G'L‘)^a;) -F (^) + (x)|; 


„(i) _Ji) 


xi+i = X 


t ’ 


for j ^ it; 


end 


Output: X = {oit-i - \Laf) Xf / Yh=i («t-i “ 


Algorithm 1: Uniformly randomized stochastic block dual averaging (SBDA-u) method. 


The following theorem illustrates an important relation to analyze the convergence of SBDA-u. 
Throughout the analysis we assume the simple function co{x) is A-strongly convex with modulus A, 
where A > 0. 

Theorem 4. In algorithmUl if the sequence {yfj satisfies the assumption , then for any x € X, 
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we have 


T 


E 




n — 1 


n 


n — 1 


cxi ) E [(/> (xt) - (/) (x)] < ao -[(/) (xo) - (/> (x)] + ^ E 


(^) 

7t-i 


di (x) 


2=1 


+E 


n 


r-i 


Ee 

t=0 




7 t-V+ 


( 10 ) 


Proof. Firstly, to simplify the notation, when there is no ambiguity, we use the terms cvi (x) and 
Wj(x^*)), di{x) and dj (x^*^), Vj(x,y) and interchangeably. In addition, we denote 

Wjc (x) = u (x) — Wj (x), and an auxiliary function by 


[x) 


f Z)s=0 

F {Xs,fs) + {Gs, X - + UJi^ (x) 

lEtiT’ 

di (xW) 




t > 0 
t = -1 ’ 


( 11 ) 


It can be easily seen from the definition that Xt+i is the minimizer of the problem minjjgx 'I'^ (x). 
Moreover, by the assumption on { 7 *}, we obtain 


(x) - (x) > at F{xt,Ct) + {Gt,x - (x) 

Applying Lemma [3] and the property equation (1121) at x = x^+i, we have 


t = 0,l,2,... 


( 12 ) 


<i>{xt+i) < /(Xi) + (c/t,xt+i-Xi)+ 2MiJ|xt+i-Xill(i^)+a;(xf+i) 

= F{xt,(t) + {Gt,xt+i - xt)(.^) + 2MiJ|xt+i - Xill(j^) 

+/ (xt) - F {xt, ft) + {gt - Gt, xt+i - xt)(.^) + Lo (xi+i) 

I iM ^ 

^t (xt+l) - (xt+l) + \\xt+l - XtWft 


< — 


1 

at 


{it) 


Ai 


+/ (xt) - F {xt, ft) + oJic (Xf+i) 

_|_ ^(*t) ^ 

+{gt - Gt,xt+i - Xi)(.^) - ||xt+i - Xill^.^) + 2MiJ|xi+i - xt||(j^). 


We proceed with the analysis by separately taking care of Ai and A 2 . We first provide a concrete 
bound on Ai. Applying Lemma [T]for = 'Lt_i with xt being the optimal point x = Xt+i, we obtain 


^t-i (aJt+i) > ^t-i (xt) + ^ 7 t-iVi (xt,xt+i) + '^^^\\xt - xt+i\\ft^) 


iHx. 


(13) 


2=1 


In view of (flSl) and the assumption Vj {xt,Xt+i) > ^||xt+i — Xt\\‘^-y we obtain an upper bound on 
Ai: Ai < 'I't (xi+i) — 'I't-i (xi) . On the other hand, from the Cauchy-Schwarz inequality, we have 
{gt - Gt,xt+i - xt)(.^) < \\gt - Gt\\^i^)^^ ■ \\xt+i - xt\\(i,). Then 


A 2 < \\xt+i - xt||(j,) • {\\gt - Gt\\(^i,) + 2Mi,) 


2at 


\\xt+i - Xt"^ 


(7)' 
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The right side of the above inequality is a quadratic function of ||xt_|_i — Xt\\(^i^y By maximizing it, 
we obtain 

^ at{\\gt-Gt\\yy + 2M,,)^ 

2 

In view of these bounds on Ai and A 2 , and the fact that (xt) = ojq (xj+i), we have 

at(t){xt+i) < + {xt)-F{xt,it) + ^i'l{xt)\ 

“t (Ibt - G'tll(it) + 


+ - 




(14) 


Summing up the above for t = 0,1,... , T — 1, and observing that T_i > 0, di (xq) > 0 (1 < i < n), 
we obtain 

t=o 

T-l 


^n 4 - \ 

t=Q + 


+ '^Oit[f (xt) - F {xt, 6) + Wj- {xt)] . 


(15) 


t=0 


Due to the optimality of xt, for x G A, we have 
T't—i (xt) < 'I't —1 (x) 


T-l 

t=o 

T-l 


/ (xt) + - {gt,x - Xt) + UJi, (x*) 
n 


+ '^lT-idi{x) 


i=l 


1=0 

T-l 

< 

t =0 

T-l 


+ {Gt,x - xt)yy --{gt,x - Xt) 


71 — 1 1 

-/ (xt) + -/ (x) + UJi, (x) 

n n 


+ ^J^T-ldi{x) 


i=l 


+ E 


at 


1=0 


{Gt, X - xt)yy - - {gt,x - Xt) 


(16) 


where the last inequality follows from the convexity of f:{gt,x — Xt) < f (x) — / (xt). Putting (1151) 
and (1161) together yields 

T-l T-l 


(Xt+l) < 


1=0 


n — 1 


n 


1 


4>{xt) + -4>ix 


n 


+ X]at-W*C 


i=l 


t=0 

{\\gt ~ ^t\\{it) F 

+ 2^--r ot, 

0 


2U% + t\^ 


(17) 


where 6t is defined by 


T-l 

6t = at 
1=0 
T-l 


{Gt,x - xt)yy - - {gt, x-xt) + f (xt) - F {xt, 6) 


+ ^ai 


1=0 


Tl — 1 1 

Wi? (xt) - w (xt) + UJi^ (x)- u (x) 

^ n n 


(18) 
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In (|17p . subtracting Ylt=o 4^ (^)’ then S 4 =i [</* i^t) — 4> (s^)] oii both sides , one has 
T 


E 


t=i 


t-i 


n 


^-at] [(pixt) - 4>{x)] < ^^—^ao[(j){xo)-4>{x)] + '^-f^l^di{x) 

i=l 


n 


0 


2htlp + t\>^ 


( 19 ) 


Now let us take the expectation on both sides of (1191) . Firstly, taking the expectation with re¬ 
spect to it, for t = 0 , 1 ,..., T-l, we have {Gt, x* - ^ {Gt, x* - xt), and [uq (x*)] = 

w (xt)-Ejj [wit (xt)] = (xt). Moreover, by the assumptions E^^ [F {xt, 6 )] = / ixt), Eg^ [G (x*, ^t)] 

g{xt). Together with the definition (1181) . we see E[6t] = 0. In addition, from the Cauchy- 

Schwarz inequality, we have (||( 7 i — Gt\\qq + < 2 ^\\gt — GtW^-q + , and the expectation 

Es. IliB-Gir;;,, < EsJIG.IIf,, < Mf^. Furthermore, since ^t is independent of qt-i and It-i, we 
have 


E 


{Wat - GtWqq + 2Mi) 

(^t) I /(h) \ 

Tt-iP + liJiX 


21 


< E 


< E 


E. 


g.-G.ii;..) + 4Mi; 

di‘ip+'£b 


imi 


\ I /(^^) \ / 


Ee 

2=1 


lOM.^ 


n (7t-V + ^ 2 i^ 


Using these results, we obtain 


E 

t=i 


n — 1 




n 


atjE[4>{xt)-(j){x)] < ap [<(> (xq) - (/> (x)] + E 

^ 5M2 

+ > - - > E 

^^ n. .‘G—/ 


(d 

7t1i 


di (x) 


n 

2=1 t=0 


2 = 1 


at 


(d , ;(d \ 
7f-iP + 


□ 


In Theorem 0] we presented some general convergence properties of SBDA-u for both stochastic 
convex and strongly convex functions. It should be noted that the right side of (1101) employs 
expectations since both { 7 ^} and {It} can be random. In the sequel, we describe more specialized 
convergence rates for both cases. Let us take x = x* and use the assumption ([3| throughout the 
analysis. 


Convergence rate when u]{x) is a simple convex function 

ii) 

Firstly, we consider a constant stepsize policy and assume that 7 ) ^ depends on i and T where T 

(d 

is the iteration number. More specifically, let at = 1, and 7 ) = j3i for some /3i > 0,1 < i < n. 
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— 1 < t < T. Then E 


T^T 

Tt-iP. 


= for 1 < i < n, and hence 

pPi' — — 1 


^ E [(/) {xt) - 4> (a:*)] < (n - 1) [0 (xo) - 4>{x*)]+n'^ f3iDi + T^ 


5M,2 


t=l 


i=l 




Let us choose fii = y for i = 1, 2,... ,p, to optimize the above function. We obtain an upper 

bound on the error term: 


^E[ 0 (xt) - (/>(x*)] < (n - 1 ) [(t>{xo) - ^(x*)] + 2 W- 


t=i 


p r=i 


If we use the average point x = Y^=i ^t/T as the output, we obtain the expected optimization error: 

2v^ [EILi 


71 — 1 

E [(t> (x) - (p (x*)] < -jr- [</> (®o) - 0 (a;*)] + 


Vp^ 


In addition, we can also choose varying stepsizes without knowing ahead the iteration number 


T. Differing from traditional stepsize policies where 7 t is usually associated with t, here ' j- is a 
random sequence dependent on both t and it- In order to establish the convergence rate with such 
a randomized 'jt, we hrst state a useful technical result. 

Lemma 5. Let p be a real number with 0<p<l, {os} and { 64 } be sequences of nonnegative 
numbers satisfying the relation: 


at = pbt + {I - p)at-i, t = l,2,... 

Then 

E<‘.sE'>. + 7 - 

s=0 s=l ^ 

We hrst let ccf = 1, and dehne { 7 ^} recursively as 

(i) _ j Uiy/t + 1 i = it 

- S (i) • / • ’ 

[tE i * E H 

for some Uj > 0, i = 1, 2,..., n, t = 0,1, 2,... , T. From this dehnition, we obtain 


E 


(*) 

LtEiJ 


1 1 n-I 

- -p H-E 

n UiVt n 


(d 

L7E2J 


Observing the fact that Er=i ~ 2 \/F+T and applying Lemma[5]with at = E 


and ht = ^-771 we have 

UiVt' 


(i) 

Tt-1 


Ee 

T = 0 


(d 

7 E 1 


t 


1 ^> 1 n 2^^ t T 1 n 
— — / — 7 ^ ^-PT —-1-PT' 
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Hence 


T-l 

Ee w 

t=o iTt-iP. 


1 

< - 


2 \/r n 

Ui 


i = 1 , 2 ,... n. 


( 20 ) 


With respect to (1201) and Theorem 1 , we obtain 

n T—1 n 


+ 


2 = 1 


5a|M2 
(d „ 


t=o i=i L™7f- 

Choosing m = we have 


[5M2 

2Vt n 

1 

\ np 

Ui 

i 


r//-N ^ / *M / ^ ^ *M ^nMf 2 X)i=i ^WnMfDi 

E [(() (x) )] < [(j) (xo) -(j){x )] + ^ — 7 ^ ' 

i=i Pl-iT 


+ 


T - ^ p^i^lT ' ^pVT 

We summarize the results in the following corollary: 

Corollary 6. In algorithmic T > 0, x be the average point x = Y^=i ^t/T , cind at = 1. 

If 7 ^*^ = \/i^, for t = 0 , 1 , 2 ,..., T - 1 , i = 1 , 2 ,..., n, then 

(n- 1 ) [</.(xo)-(^(x*)] ‘^T.l=i\J^nMfDi ^ 


2. If = 


E[<^(x)-((-(x*)] < 




(*) 

7t-i 


, for t = 0,1, 2,..., T — 1, and 7 ^^ = 


o.w. 


10 M2 . _ , „ 

npDi ) * — I>2, 


then 


r, ^ Ai *\^ 5nM)^ 2 X]i=i \Jl0nMfDi 

E (x) - (/> (x )] < (xo) - (/> (x )] + 2_, —^ + 


pf-iT 




Corollary [ 6 ] provides both constant and adaptive stepsizes and SBDA-u obtains a rate of conver¬ 
gence of O [l/y/r^ioi both, which matches the optimal rate for nonsmooth stochastic approximation 
[please see (2.48) in | 2 T]]. In the context of nonsmooth deterministic problem, it also matches the 
convergence rate of the subgradient method. However, it is more interesting to compare this with 
the convergence rate of BCD methods [please see, for example. Corollary 2.2 part b) in [^j. Ignoring 

( /y^n _\ 

-——- C” YJi=i Di 1 . Although the rate of 


O 


is unimprovable, it can be seen (using the Cauchy-Schwarz inequality) that 




2=1 


A 


Eo- 


2=1 


with the equality holding if and only if the ratio Mf /Di is equal to some positive constant, 1 < i < n. 
However, if this ratio is very different in each coordinate block, SBDA-u is able to obtain a much 
tighter bound. To see this point, consider the sequences {Mi} and {Dj} such that k items in {Mj} 
are 0{M) for some integer k, 0 < k n, while the rest are o(l/n) and Di is uniformly bounded 
by D, 1 < i < n. Then the constant in SBDA-u is 0{y/nkM'\/~D) while the one in SBMD is 
0{ny/kM^/D), which is y/njk times larger. 
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Convergence rate when u (x) is strongly convex 

In this section, we investigate the convergence of SBDA-u when w (x) is strongly convex with modulus 
A, A > 0. More specifically, we consider two averaging schemes and stepsize selections. In the 
first approach, we apply a simple averaging scheme similar to |36) . By setting = 1, all the 
past stochastic block subgradients are weighted equally. In the second approach we apply a more 
aggressive weighting scheme, which puts more weights on the later iterates. 

To prove the convergence of SBDA-u when uj (x) is strongly convex, we introduce in the following 
lemma, a useful “coupling” property for Bernoulli random variables: 


Lemma 7. Let ri,r 2 ,r 3 be i.i.d. samples from Bernoulli{p), 0 < p < 1, a, 6 > 0, and any x, such 
that 0 < X < a, then 


E 


1 

< E 

1 

rix -|- r 2 (a — x) -|- 6 

r^a + b 



( 21 ) 


In the next corollary, we derive these specific convergence rates for strongly convex problems. 
Corollary 8 . In algorithmUf if oj (x) is \-strongly convex with modulus A > 0, then 


1. if at = 1, = X/p, for t = 0,1, 2 ,..., T — 1 , and x = Ylt=i ^t/T, then 


B[f{x)-<fix*)] < 


(n - 1 ) [0 (xo) -4>{x*)] +nX/pJ2'l^^di (x*) 


+ 


5n{EtiMf)logiT + l) 


XT 


2. if Oi = n -|- t, for t = 0,1, 2,..., and a_i = 0, 7 ^*^ = A (2n + T) /p, for t = 0,1, 2,..., T — 1, 
then 


E[<(-(x)-<(-(x*)] < 


2n (n - 1) [(j) (xo) - f (x*)] -h 2n (2n -h T) X/pJffl dt (x*) 


+ 


10^ (EILi M!) 

A(r + i) 


1 + 


T{T+1) 
n + {n + 1) logT' 


Proof. In part 1), let at ^ 1, 7 !* = A/p, it can be observed that l^\ ~ Binomial (t, t > 0, 

we have 


E 


1 


i{i') \ I (^) 

lit-i^ + ii-iPj 




i=0 


n 


t—i 


1 


A (i -|-1) 


n 


E 


t + l\ /iV+Vn-l 


A (t -|- 1 ) ^ \i + 1J \n J 

i=0 


n 


n 


< 


A (t -|-1) 
n 


1 - 


n 


— ) 


t+i 


X(t -\- \) 

Observing the fact that Y1t=o (^ + 2), we obtain 

(,,.)] < (n-l)l4,{xo)-4>{x>)] + \H/pY:U<‘iir) , 5n(E?.iA7)log(r+l) 

T XT 
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In part 2), let at = n + t, for t = 0,1, 2,..and a_i = 0, 7 ^*^ = A (2n + T) /p, then for t > 0, 
for any fixed i, let hence ~ Bernoulli (p). In addition, we assume a sequence of ghost 

i.i.d. samples For t > 0, 


E 


/(d \ I (d 
Ut A + 7t P. 


= E 


< E 


< E 


< 


.^J2l=0 (n + s) + 7fV_ 


^ (re + s) + rt-s (re + t - s)} + A ( 2 re + T) 

1 


A(2re + t) + 


re 


A ( 2 re + t) (max{|'t/ 2 ], 1 }) 


( 22 ) 


where the second inequality follows from the independence of {rg} and {r(} and the coupling property 
in Lemma [71 It can be seen that the conclusion in (|22l) holds when t = — 1, 0 as well. Hence 


a? 


T-l 

^ (*) I -v 

4=0 llt-lP + ^4-1^ 


< 


< 


(n + tf 


T-l 

^ A (2re + t - 1) (max {((t - 1) / 2 ], 1 }) 


T-l 

E 

4=0 


(re + 1 ) 


A(max{[(t- 1)/2],1}) 


2 re + 1 (re + t) 


A 


< 


2re + 1 2 (re + t) 

+ E 


A 


4=2 


A(t-l) 


T-l 


2re + 2T — 1 2 (re + 1) 

+ E 


A 


4E^ 


< 


2re + 2r 2 (re + 1) 1 , 

~t“-^- / —dx 

Ji X 


A 


A 


< - (re + T + (re + 1 ) log T). 

A 


txt 

Let X = be the weighted average point, then 


eXh 


E[,^(x)-0(x*)] < 


2re (re - 1) [(/>(xo) - (/>(x*)] + 2re (2re + T) X/pJ2i A 


+ 


lore (Er=i ) 

A(r + i) 


T(r +1) 


1 + 


re + (re + 1 ) logT 


□ 


For nonsmooth and strongly convex objectives, we presented two options to select {0:4} and {74}. 
These results seem to provide new insights on the dual averaging approach as well. To see this. 
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we consider SBDA-u when n = 1. In the first scheme, when = 1, the convergence rate of 
O (log T/T) is similar to the one in |36) . In the second scheme of Corollary[ 8 l it shows that regularized 
dual averaging methods can be easily improved to be optimal while being equipped with a more 
aggressive averaging scheme. Our observation suggests an alternative with rate O (l/T) to the more 
complicated accelerated scheme ([IE]). Such results seems new to the world of simple averaging 
methods, and is on par with the recent discoveries for stochastic mirror descent methods (ism El El 
ESI El]). 


4 Nonuniformly randomized SBDA (SBDA-r) 

In this section we consider the general nonsmooth convex problem when u (x) = 0 or a; (x) is lumped 
into /(•): 

min(/>(x) = /(x), 

x^X 

and show a variant of SBDA in which block coordinates are sampled non-uniformly. More specifically, 
we assume the block coordinates are i.i.d. sampled from a discrete distribution {Pi}i<i<n> 0 < pi < 1 , 
1 < i < n. We describe in Algorithm El the nonuniformly randomized stochastic block dual averaging 
method (SBDA-r). 

Input: convex function /, sequence of samples {^t}) distribution {pi}i<i<n> 

initialize ao G 7 -i S ]R"',G = 0'^ and xq = argmin^jg^ (x^*)); 

for t = 0,1,..., T — 1 do 

sample a block it £ {1, 2,..., n} with probability Prob {it = i) = pf, 
set 7 ^*^ z = 1 , 2,..., n; 

receive sample ^t and update G: G = G + {xt,G)'-: 

Pif 

update x["l\ = argmin 2 ;gXi^ |(G(*‘),x) -h (a;)|; 

set Xj^i = xY\j^it\ 

end 

Output: X = (eLo / (ELo ; 

Algorithm 2: Nonuniformly randomized stochastic block dual averaging (SBDA-r) method 


In the next theorem, we present the main convergence property of SBDA-r, which expresses the 
bound of the expected optimization error as a joint function of the sampling distribution {pi}, and 
the sequences {at}, { 7 t}- 


Theorem 9. In algorithm}^ let {xt} be the generated solutions and x* he the optimal solution, {at} 
be a sequenee of positive numbers, { 7 ^} be a sequence of vectors satisfying the assumption . Let 
- _ 2^=0 t t ^/jg average point, then 


E[/(*)-/(i')l< 


Ez=0 [ t=0 i=l 


n E 

+E- 

i=l 


(d 

Tt 


Pi 


di (x) 


(23) 


Proof. For the sake of simplicity, let us denote At = Er=o®'r’ ^ ~ 0,1,2.... Based on the 
convexity of /, we have / and / (xt) < f (x) + {gt,Xt — x) for x £ A. 

Then 
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[f (x) - f (x)] < ^ at {gt, xt - x) 


t=o 

T 


< ^ — (Uii,G^t^\xt -x^ + '^at Igt - —- x 
j_g Pit -f_n \ Pit 


t=0 


Ai 


A 2 


It suffices to provide precise bounds on the expectation of Ai, A 2 separately. 
We define the auxiliary function 


5,, (,) = ; Ebo S7 (f'-.G?-’. + Et. f <4 * > 0 


t = -1 


Thus 


Ti(a;t+i) = minTt(x) 


U=o 


> min S ^ \Ui^G^s*'\xJ + ^ ^ 


n («) 


2 = 1 


= min<j —+ Tt_i (x) 


( 24 ) 


(25) 


The first inequality follows from the property (|9|). Next, using (12 5 p and Lemma |2l we obtain 
^ (Ui,Gf'\xt) < (xt+i) - 'Li.i (xt) + 

Pit ‘i'PPitTt-i 

Summing up the above inequality for t = 0,... ,T, we have 


^ ^ {m,Gf\xt) < (XT+I) - (xo) + 


t=0 


Pit 


a 


t=o ‘^PPitft-i 


t ||^^n2 


{it) 


Moreover, by the optimality of x^+i in solving rnffia, Ty (x), for all x £ X, we have 

T n (2) 

Tt (xT+i) < ^ —+ ^^di(x). 

t=o^** i=i 


Putting (I26p and (1271) together, and using the fact that T_i (xg) > 0, we obtain: 


n (i) ^2 

P^ i=g 


2=1 


For each t, taking expectation w.r.t. it, we have 


(26) 


(27) 
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E 


at 


V^PPitTt-i 


= E 

E 7 




a 


Ee 

2=1 


V^ppitrt-i 

lW-1 


As a consequence, one has 


n E 


(*) 

Tt 


T n 


E [^i] < E (^) + E E E 


2=1 


Pi 


t=0 i=l 








In addition, taking the expectation with respect to it, and noting that 
E^j [Gf] — gt = 0, we obtain 

E [A 2 ] = 0. 

In view of fl28l) and (1291) . we obtain the bound on the expected optimization error: 


^-U^.Gt 


Pit 


( 28 ) 

9t = 
(29) 


E [/ (x) - / (x)] < -J— IE E E 

Et=o«t [ 


T n 


t=0 2=1 


aUGt 




2p7£i 


n E 

+E- 


(d 

Tt 


2=1 


Pi 


di (x) > . 


□ 


Block Coordinates Sampling and Analysis 


In view of Theorem 4, the obtained upper bound can be conceived as a joint function of probability 
mass {pi}, and the control sequences {at}, { 7 *}- Firstly, throughout this section, let x = x* and 
assume 

at = l, t = 0,1,2,.... (30) 

Naturally, we can choose the distribution and stepsizes by optimizing the bound 


T n 


min 

{'yt},p 


£({7i}>p) = EEE 

t=0 i=l 


Mf 

.‘^Plth. 


+E 


E[7 


(di 


-A 


Pi 


(31) 


This is a joint problem on two groups of variables. Let us first discuss how to choose { 7 ^} for any 
fixed Pi. Let us assume pt has the form: pt = —4— i = 1, 2,... , re, where a,b > 0, and define 

^a,b 

Ca,b = Yl'i=i ^i^i- derive two stepsizes rules, depending on whether the iteration number T 
is known or not. We assume 7 ^*^ = /3j, for some constant /3j, i = 1,2, ...re, t = 1,2, ...,r. The 
equivalent problem with p, j3, has the form 


min£(p,/3) = ^ 


(T + l)Mf , ^ A 




+ E-A. 

^ Pi 


By optimizing w.r.t. /3, we obtain the optimal solutions 


1? = A = 


\l + T)piMf 


2pD, 


(32) 


(33) 
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In addition, we can also select stepsizes without assuming the iteration number T. Let us denote 

(34) 


(i) _ I + lui if i = it, 
otherwise, 


(d 

Tt-i 


for some unspecified Ui, 1 < i < n. Applying Lemma [5] with at = E ^ 

T 


(i) 

Tt-i 


bt = 


) 


iVi' 


we have 


T 


Ee 


(d 

t=o llt-l 


1 


S E ^ +-iW s 


(d ■ 
l-iPi 


In view of the above analysis, we can relax the problem to the following: 


mm 

p,U 


in 

11. * ^ 


2=1 _ 


MfVrTT , , 

“r ~T 


pUi 


Pi 


2p'y^l\pi_ 


Note that the third term above is o hence can be ignored for the sake of simplicity. Thus 

we have the approximate problem 


mm 

p,u 


in 

II. f ^ 


2 = 1 *- 


m^VtTi , UiVrTT^ 


pui 


Pi 


(35) 


j Pi 

we apply the similar analysis and obtain Ui = and hence the second stepsize rule 

,!•) = / 


It = 


(d 

7t-i 


t > 0 . 


(36) 


it 

otherwise 

We have established the relation between the optimized sampling probability and stepsizes. Now 
we are ready to discuss specific choices of the probability distribution. Firstly, the simplest way is 
to set 

Pi = -, i = (37) 

n 

which implies that SBDA-r reduces to the uniform sampling method SBDA-u with the obtained 
stepsizes entirely similar to the ones we derived earlier. However, from the above analysis, it is 
possible to choose the sampling distribution properly and obtain a further improved convergence 
rate. Next we show how to obtain the optimal sampling and stepsize policies from solving the joint 
problem (1311) . We first describe an important property in the following lemma. 


Lemma 10. Let iS” be the n-dimensional simplex. The optimal solution x*, y* of the nonlinear 

ll+, 


problem min^gKn^^^g^. YJi=i t ^ 


where o*, > 0, 1 < i < n is 

y* = {ai/bi)^W, and x* = a^b^Vw, 
where i = 1, 2,... n and W = ) 


-1 


Applying Lemma ITOl to the problem 


we obtain the optimal sampling probability 


Pi = M}DflC, i = l,2,...n (38) 

where C is the normalizing constant. This is also the optimal probability in problem (1351) . In view 
of these results, we obtain the specific convergence rates in the following corollary: 
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Corollary 11. In algorithmic at = 1, t > 0. Denote C = ? with block 

coordinates sampled from distribution LS8\) . Then: 


1 . 


■f W 
if 7t = 


2pC 


1/3 


t> -l,i = 1 , 2 , 


,n 


then 


E[f{x)-f{x*)] < 


V2 C3/2 

^VtTT' 


(39) 


2 . if 7i1 = ^ = 


then 


E|/(*)-/(i^')l< 


'0 
-1 

C.3/2 


(*) 

7t-i 


t > 0, i = 1,2, 


o.w. 


+ 


LVrTT 2 (T + 1)J- 

Proof. It remains to plug the value of { 7 *}, p back into C{,). 


,n, 


(40) 


□ 


It is interesting to compare the convergence properties of SBDA-r with that of SBDA-u and SBMD. 
SBDA with uniform sampling of block coordinates only yields suboptimal dependence on the mul¬ 
tiplicative constants. Nevertheless, the rate can be further improved by employing optimal nonuni¬ 
form sampling. To develop further intuition, we relate the two rates of convergence with the help of 
HA^lder’s inequality: 




-1 3/2 


,1=1 


< 




1 2/^ r 


2 = 1 


Ei= 

,2 = 1 


1 1 / 3 ' 


3/2 


(Mix/lTj) ■ y/n. 


2=1 


The inequality is tight if and only if for some constant c > 0 and /, 1 < z < n: Miy/Di = c. In addi¬ 
tion, we compare SBDA-r with a nonuniform version of SBME0, which obtains o ^ V^"=i 

assuming blocks are sampled based on the distribution pt oc y/Di- Again, applying HA^lder’s in¬ 
equality, we have 


n 

3/2 

r ^ V qI 

1/3 


2/3 I 

3/2 

n n 


< < 

E Kf 






_2=1 

\ 

_2 = 1 


_2 = 1 

> 

\ 

i=l i=l 


In conclusion, SBDA-r, equipped with an optimized block sampling scheme, obtains the best 
iteration complexity among all the block subgradient methods. 


5 Experiments 

In this section, we examine the theoretical advantages of SBDA through several preliminary exper¬ 
iments. For all the algorithms compared, we estimate the parameters and tune the best stepsizes 
using separate validation data. We first investigate the performance of SBDA on nonsmooth deter¬ 
ministic problems by comparing its performance against other nonsmooth algorithms. We compare 
with the following algorithms: SMI and SM2 are subgradient mirror decent methods with step- 
sizes 7 i oc and 72 oc respectively. Finally, SGD is stochastic mirror descent and SDA a 

^See Corollary 2.2, part a) of [3] 
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Figure 1: Tests on li regression. 


stochastic subgradient dual averaging method. We study the problem of robust linear regression 
{(.i regression) with the objective </> (x) = ^ YllLi — af x\. The optimal solution x* and each a* 
are generated from AA(0, /^xn)- In addition, we define a scaling vector s G MT' and S a diagonal 
matrix s.t. Su = Si. We let b = {AS) x* + a, where A = [ 02 , a 2 , • • •, a-m]^ G and the noise 

(T ~ AA (0, pi) . We set p = 0.01 and m = n = 5000. 

We plot the optimization objective with the number of passes of the dataset in Figure [H for 
four different choices of s. In the hrst test case (leftmost subhgure), we let s = [1,1,... ,1]^ so 
that columns of A correspond to uniform scaling. We hnd that SBDA-u and SBDA-r have slightly 
better performance than the other algorithms while exhibiting very similar performance. In the 
next three cases, s is generated from the distribution p (x; a) = a (1 — x)“~^, 0 < x < 1, a > 0. We 
set a = 1,5,30 respectively. Employing a large a ensures that the bounds on the norms of block 
subgradients follow the power law. We observe that stochastic methods outperform the deterministic 
methods, and SBDA-based algorithms have comparable and often better performance than SGD 
algorithms. In particular, SBDA-r exhibits the best performance, which clearly shows the advantage 
of SBDA with the nonuniform sampling scheme. 

Next, we examine the performance of SBDA for online learning and stochastic approximation. 
We conduct simulated experiments on the problem: (p{x) = Ea,b [{b — (Ta,x))^] , where the aim is 
to ht linear regression under a linear transform L. The transform matrix L G is generated 

as follows: we hrst sample a matrix L for which each entry Lij ~ AA(0,1). L is obtained from L 
with 90% of the rows being randomly rescaled by a factor p. To obtain the optimal solution x*, we 
hrst generate a random vector from the distribution J\f{0, I^xn) and then truncate each coordinate 
in [—1,1]. Simulated samples are generated according to 6 = {La,x*) +e where e G AA(0,0.01/nxn)- 
We let n = 200, and generate 3000 independent samples for training and 10000 independent samples 
for testing. 

To compare the performances of these algorithms under various conditions, we tune the parameter 
p in [1,0.1,0.05,0.01]. As can be seen from above, p affects the estimation of block-wise parameters 
{Mi}. In Figure 121 we show the objective function for the average of 20 runs. The experimental 
results show the advantages of SBDA over SBMD. When p = 1, SBDA-u, SBDA-r, and SBMD have 
the same theoretical convergence rate, and exhibit similar performance. However, as p decreases, the 
“importance” of 90% of the blocks is diminishing and we hnd SBDA-u and SBDA-r both outperform 
SBMD. Moreover, SBDA-r seems to perform the best, suggesting the advantage of our proposed 
stepsize and sampling schemes which are adaptive to the block structures. These observations lends 
empirical support to our theoretical analysis. 

Our next experiment considers online regularized linear regression (Lasso): 


min 

lueR" 


1 

2 




{y-uFx)^ +A||w|| 


(41) 


While linear regression has been well studied in the literature, recent work is interested in efficient 
regression algorithms under different adversarial circumstances HI 13 IIP) . Under the assumptions 
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Figure 2: Tests on linear regression, Left to right: p = 1,0.1,0.05,0.01. 



(a) Test on covtype dataset. (b) Test on mnist dataset. 

Figure 3: Tests on online lasso with limited budgets 


of limit budgets, the learner only partially observes the features for each incoming instance, but is 
allowed to choose the sampling distribution of the features. In addition, we explicitly enforce the ii 
penalty, expecting to learn a sparse solution that effectively reduces testing cost. To apply stochastic 
methods, we estimate the stochastic coordinate gradient of the least squares loss. For the sake of 
simplicity, we assume for each input sample instance (y,x), two features are revealed. When 

we sample one coordinate jt from some distribution {pj}, then is an unbiased estimator 

of w"^x. Hence the defined value is an unbiased estimator of the 

Pit ^ 

i^-th coordinate gradient. 

We adapt both SBMD and SBDA-u to these problems and conduct the experiments on datasets 
covtype and mnist (digit “3 vs 5”). We also implement MD (composite mirror descent) and DA 
(regularized dual averaging method). For all the methods, the training uses the same total number 
of features. However, SBMD and SBDA-u obtain features sampled using a uniform distribution; both 
MD and DA have “unfair” access to observe full feature vectors and therefore have the advantages 
of lower variance. We plot in Figures [3a] and l3bl the optimization error and sparsity patterns with 
respect to the penalty weights A on the two datasets. It can be seen that SBDA-u has comparable 
and often better optimization accuracy than SBMD. In addition, we also plot the sparsity patterns 
for different values of A. It can be seen that SBDA-u is very effective in enhancing sparsity, more 
efficient than SBMD, MD, and comparable to DA which doesn’t have such budget constraints. 

6 Discussion 

In this paper we introduced SBDA, a new family of block subgradient methods for nonsmooth and 
stochastic optimization, based on a novel extension of dual averaging methods. We specialized 
SBDA-u for regularized problems with nonsmooth or strongly convex regularizers, and SBDA-r 
for general nonsmooth problems. We proposed novel randomized stepsizes and optimal sampling 
schemes which are truly block adaptive, and thereby obtain a set of sharper bounds. Experiments 
demonstrate the advantage of SBDA methods compared with subgradient methods on nonsmooth 
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deterministic and stochastic optimization. In the future, we will extend SBDA to an important 
class of regularized learning problems consisting of the finite sum of differentiable losses. On such 
problems, recent work |31| |32] shows efficient BCD convergence at linear rate. The works in |39| |35] 
propose randomized BCD methods that sample both primal and dual variables. However both 
methods apply conservative stepsizes which take the maximum of the block Lipschitz constant. It 
would be interesting to see whether our techniques of block-wise stepsizes and nonuniform sampling 
can be applied in this setting as well to obtain improved performance. 

7 Appendix 

Proof of Lemma [T] 

Proof. The first part comes from Let g (z) denote any subgradient of / at 2 ;. Since f (x) is 

strongly convex, we have / (x) > f (z) + {z) ,x — z) + ^\\x — z\\‘^.y By the definition of z and 

optimality condition, we have (z) = —'S/id{z). Thus 

f{x) + {Vid{z),x- z) > fiz) + ^||x - Z\\yy 

It remains to apply the definition x = z + Uiy and V (z, x) = d (x) — d{z) — (Vd (z) ,x — z). □ 

Proof of Lemma [2] 

Proof. Let h{y) = max^^x {{v, x) — T (x)}, since T (•) is strongly convex and separable, /i (•) is 
convex and differentiable and its i-th block gradient Vj/i (•) is L.gniooth . Moreover, we have 
V/i (0) = xo by the definition of xq. Thus 

h < /i(0) 

It remains to plug in the definition of h (•), z, Xq- □ 

Proof of Lemma [3] 

Conjecture. By convexity of f (•), we have f{z) < f{x) + {g{z),z — x). In addition, 

{g{z),z-x) = {g{x),z - x) + {g{z) - g{x),z - x) 

= ( 5 ^*^ {x), y) (i) + ( 5 ^*^ {z)- g^^\x), y) (j) 

< (5W(x),y)(i) -k \\g^^\z) - g^^\x)\\^i)^^ ■ ||y||(i). 

The second equation follows from the relation between x, y, z and the last one from the Cauchy- 
Schwarz inequality. Finally the conclusion directly follows from 

Proof of Lemma [5] 

Proof. Let At = ^1^3=0 ~ Ss=i equivalent to show At < Bt + Then 
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At 


pBt + Ao + (1 -p) At-i 

[p + (1 - p)] \pBt-i + ^o] + (1 - At-2 


p+{l-p) + (1 -p)^ 


\pBt-i + ^o] + (1 - pf At-3 


^ \pBt + ^o] 


t 




The last inequality follows from the assumption that Bt > Bg where 0 < s < t and Aq = oq- It 
remains to apply the inequality “ pY — “ pT ~ 


Proof of Lemma 0 

Proof. If ri, r 2 , rs ~ Bernoulli (p), c > 0, 0 < p < 1, 


E 


rix + r 2 (a — x) + 5 


(1 -P) P(1 -P) P(1 -P) P 


a — X + b X + b 


a + b 


^ (1 -P) P(1 -P) P(1 -P) P 


b 

l-p 


a + b 


a + b 


+ 


P 


= E 


b a + b 

1 


rga + b 


A I _^ 

r+c a-x-l 

convex in [0, a], then max^g[o,a] / (x) = max {/ (0), / (a)}. 


To see the first inequality, let / (x) = , where A,B > 0, it can be seen that /(•) is 

□ 


Proof of Lemma [TOl 

Proof. Let x*, y* be the optimal solution of miUx^y C {x,y,a,b). We consider two subproblems. 
Firstly, x* = argmin^, £ (x, y*,a, b). Since ^ ^ > 2^^, at optimality 

/T. 0 -* 

(42) 


X* b,y* 

On the other hand, y* is the minimizer of the problem min^ £ (x*, p, a, 6). Applying the Cauchy- 
Schwarz inequality to £ {x*,y, a, b), we obtain 


E*-i = E£t:E!'<^Eia:^=E 


sr ‘ + I’m ^ 

At optimality, the equality holds for some scalar C > 0, 

X* 


Vi 


biVl 


= Cy*, i = l,2,...,n. 


It remains to solve the equations (1421) and (1431) with the simplex constraint on y. 


(43) 

□ 
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